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ABSTRACT 

ccPDB (http://crdd.osdd.net/raghava/ccpdb/) is a 
database of data sets compiled from the literature 
and Protein Data Bank (PDB). First, we collected 
and compiled data sets from the literature used for 
developing bioinformatics methods to annotate 
the structure and function of proteins. Second, 
data sets were derived from the latest release of 
PDB using standard protocols. Third, we developed 
a powerful module for creating a wide range of 
customized data sets from the current release of 
PDB. This is a flexible module that allows users to 
create data sets using a simple six step procedure. 
In addition, a number of web services have been 
integrated in ccPDB, which include submission 
of jobs on PDB-based servers, annotation of 
protein structures and generation of patterns. This 
database maintains >30 types of data sets such 
as secondary structure, tight-turns, nucleotide 
interacting residues, metals interacting residues, 
DNA/RNA binding residues and so on. 

INTRODUCTION 

Annotating the structure and function of a protein is 
one of the major challenges in the post-genomic era. 
Development of bioinformatic methods for such annota- 
tion requires experimentally proven data for training, 
testing and validation. Hence, clean/refined data sets 
(e.g. non-redundant, experimentally validated) are the 
heart of bioinformatic methods. Protein Data Bank 
(PDB) is one of the major sources of experimentally 
obtained data and it contains > 74 000 protein structures 
determined using X-ray crystallography, NMR spectros- 
copy and other techniques (1). It plays a vital role in the 
field of protein structure/function annotation as most of 



the bioinformatic tools rely on the data derived from 
PDB. In order to facilitate protein community, a large 
number of secondary databases have been derived from 
PDB, which includes SCOP (2), CATH (3), SuperSite (4), 
PDB-ligand (5), PDBsum (6), etc. In addition, a number 
of tools have been developed for extracting useful infor- 
mation from PDB and secondary databases such as DSSP 
(7), PROMOTIF (8), LPC (9) and HBPLUS (10). 

Recently, Joosten et al. (11) described a series of 
PDB-related databases that are heavily used for develop- 
ing bioinformatic techniques. They described mainly the 
databases developed and maintained by their own group, 
which include DSSP (secondary structure of proteins), 
HSSP (12) (multiple sequence alignment), PDBFINDER 
(summaries of PDB file), PDBSELECT (13) (non- 
redundant proteins) and WHY_NOT (explanation as to 
why entries in other databases cannot exist). These data- 
bases are useful for developing new methods in the field of 
structural bioinformatics. However, additional scripts/ 
software are required for extracting or parsing the 
data to obtain clean/refined data sets. Wang and 
Dunbrack (14) developed a server PISCES to cull 
protein sequences/structures from a set of PDB codes or 
FASTA sequences with an option to select the cutoff for 
sequence identity. This server is specific to a set of protein 
sequences/structures. 

In this work, we developed a database, which is a 
collection of commonly used data sets for structural or 
functional annotation of proteins. We have accumulated 
numerous data sets from the literature, which were 
used for developing methods to annotate proteins at the 
sequence (or residue) level. In order to provide updated 
data sets on the latest release of PDB, we created and 
maintained important data sets from various releases of 
PDB. In addition, for providing customized data sets, we 
also developed a series of web-based tools for creating new 
data sets. These newly created data sets can be used to 
benchmark the existing and newly developed methods 
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such as Protein Classification Benchmark Collection (15), 
prediction of protein secondary structures (16), evaluation 
of multiple sequence alignment (17), identification of 
binding sites in DNA/RNA binding proteins (18,19), 
ATP and ligand binding sites (20,21) and so on. 

SYSTEMS AND METHODS 

Data collection and organization 

We extracted most of the bioinformatics related papers 
from Pubmed and other resources, and obtained the 
data sets from supplementary materials, databases, 
websites and/or directly from the authors. These data 
sets were classified with their contents and maintained at 
ccPDB. In order to compile data sets, we downloaded all 
PDB files from http://www.pdb.org/. These PDB files are 
maintained/mirrored at our server using rsync command, 
which allows users to create customized data sets from the 
latest release of PDB. We also maintain DSSP database 
in our server, which provides secondary structure and 
other related information. In ccPDB database, we used 
various software packages for deriving useful information 
from PDB. The following are major software used in 
ccPDB: (i) PROMOTIF for identifying structural motifs 
(8); (ii) LPC for generating ligand-protein interaction 
data (9); (hi) PDIdb for identifying amino acid residues, 
which are interacting with DNA/RNA (22,23); and 
(iv) In-house Perl scripts for wide range of calculations 
and analyzing PDB files. 

Database architecture 

ccPDB is built on Apache HTTP server 2.2 with MySQL 
server 5.1.47 as the back end and the PHP 5.2.9, HTML 
and JavaScript as the front end. Apache, MySQL and 
PHP technology were preferred as they are open-source 
software and platform independent. 



IMPLEMENTATION 

This is a comprehensive database, which maintains 
existing data sets collected from the literature and 
compiled data sets derived from PDB. In addition, 
database server part also allows users to create customized 
data sets. The database is broadly divided into three 
sections and the brief description of each section is 
given below. 



Collection of data sets 

This section maintains published data sets that were used 
for developing prediction methods. These data sets were 
collected from the literature after an extensive search. 
These data sets are divided into various categories as 
described below: 

• Protein secondary structure: in this category, we 
maintain data sets used for developing secondary 
structure prediction of proteins. 

• Nucleotide interacting residues: it contains data sets 
used for developing prediction methods for DNA or 
RNA interacting residues in proteins. 

• Ligand interacting residues: it maintains data sets used 
for predicting ligand-protein interacting residues 
(e.g. ATP, GTP, FAD, MAN, etc.). 

Compilation of data sets 

Data sets in the section 'Collection of data sets' are useful 
for benchmarking any newly developed methods with 
existing methods. For developing a new method, one 
should generate data sets from the latest release of PDB, 
as the performance of a method mainly depends on the 
size of the data set. Hence, old data sets generated 
earlier would become obsolete as the number of protein 
structures in PDB is rapidly increasing. In order to reduce 
the task of developing data sets to protein community, we 
compile and maintain data sets from the latest releases 
(July 2011) of PDB. In addition, we will also maintain 
data sets generated from the previous releases of PDB. 
Data sets compiled from PDB are listed in Table 1 
along with their compilation procedures. 

Creation of data set 

This is a major module of ccPDB developed for creating 
customized data sets. In order to facilitate users, we 
developed a six step procedure for creating customized 
data sets that provides full flexibility in each step 
(Table 2). Following is a brief description of each step. 

• Extract protein chains: this option allows users to 
extract specific chains from PDB, for example, extrac- 
tion of ATP binding protein chains from PDB. User 
may extract protein chains with desired structure or 
function. In addition, this option allows users to 
extract PDB chains of desired function from the list 
of PDB IDs provided by the users. 

• General filters: these filters allow users to extract PDB 
chains from the latest release with desired conditions, 



Table 1. Brief description of major data sets created at ccPDB 



Type of data set 



Description of data set 



Software package 



Secondary structure 
Tight turns 
Nucleotide interacting 
DNA/RNA binding residues 
Metal binding residues 



Data sets related to secondary structure, helix, strand, coil, etc. DSSP 

Data sets created for variuos types of tight turns (e.g. P-turn). PROMOTIF 

Data sets created for small nucleotide and metal (e.g. ATP, GTP, Fe, Mg etc) binding residues. LPC 

Data sets of DNA and RNA binding proteins/residues. PDIdb 

Data set of metal binding proteins/residues (e.g. Zn, Ca interacting residues). LPC 
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Table 2. Description of each process/step of data set creation module of ccPDB 



Prnppss / stpn 


T~)pspri ntipiti nf nropp^s 


Pn m r\l p 
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1^1 OLClil^ Cildlll 


flllUWS USCIS LO CXllctCl YlJD LIlcllIlS Iiavlllj; UCM1CU 


CALIdCL All UlllUlIliJ piOLClil LllcllIlS lIOlll YUD. 




structure or function. 




General filters 


Extract chains using various filters like resolution, 


Protein chains having resolution better than 3 A, solved by 




experimental technique. 


X-ray crystallography. 


Combination of sets 


Allows to combine two sets of data. 


ATP binding protein chains having resolution better than 






3 A, solved by X-ray crystallography. 


Extract sequences 


Extract the amino acid sequences of PDB chains. 


Extract sequences of ATP binding proteins. 


Non-redundant data sets 


Creation of non-redundant data sets using 


Generate non-redundant data set at 25% of ATP binding 




BLASTClust. 


proteins. 


Annotation of residues 


Assigning structure/function of each residue in 


Mapping of ATP interacting residues in ATP binding 




PDB chains. 


protein chains. 



for example users can select protein chains solved 
by 'X-ray' crystallography solved at a resolution 
better than 2.5 A. Major filters included in this 
option are (i) experimental method, (ii) resolution 
and (hi) length of amino acid sequence. This option 
also allows users to remove redundancy in extracted 
proteins. 

• Combination of sets: this option allows users to 
generate a new set of protein chains from two sets of 
data using various combinations. For example, it 
allows users to select chains, which are common in 
two sets or unique chains in two sets. This is useful 
for combining sets extracted from the above two steps. 

• Extracted sequences: the above three steps allow users 
to extract protein chains as per their requirement. This 
step allows users to extract amino acid sequences of 
these chains from PDB. 

• Non-redundant sequences: creation of non-redundant 
data set is important for training, testing and 
validating any prediction model. This page provides 
an option to remove the redundant sequences from 
a set of protein sequences. 

• Annotation of residues: this interface allows users to 
create data sets at residue level. For example, users 
can assign secondary structure of each residue in a 
protein. This option is designed to assign ligand/ 
DNA/RNA interacting residues in a protein. 

These six steps will help users for creating different 
types of data sets from PDB. 

Web services 

ccPDB provides a number of web services for facilitating 
PDB users. These services allow users to perform various 
types of tasks including the analysis of PDB files. 
Following is a brief description of tools integrated in 
ccPDB. 

Analysis of PDB ID. In the last two decades, a number of 
web-based services have been developed for analyzing 
PDB files. These tools have been developed by various 
groups over the years and are available at different web 
sites. Hence, one has to visit various sites and submit their 
PDB ID to use these tools. In order to facilitate users, we 



developed a web interface that integrates >40 servers, 
where users can submit their PDB ID on these servers 
from our interface. 

Structure information. This option provides following 
type of information about a PDB ID: (i) amino acid 
composition of chains, (ii) number and type of ligand/ 
metal interacting residues and (hi) tight-turns in proteins. 

Search in PDB files. This search module allows users to 
search PDB on major fields such as, ligand, organism, 
PDB code, etc. This option also allows users to display 
various types of information such as type of interacting 
residues (ligands/metals), secondary structure (DSSP 
states), tight-turns, amino acid compositions, etc. 

Generate pattern. This allows users to create patterns 
from protein chains in the desired format suitable to 
various packages of machine learning techniques like 
SVMJight, Weka and SNNS. It allows users to generate 
patterns at protein level as well as at residue level. 

Download information. This server allows users to 
download PDB files and related information that 
includes PDB, DSSP and PDBFINDER2 files. 



UPDATE OF DATABASE 

This database will be updated manually as well as auto- 
matically. In order to update the contents in 'Collection 
of data sets' section, we check recent data sets from the 
literature. We are also providing online submission facil- 
ity that will allow users to submit data sets to our 
database. 'Compilation of data sets' section will be 
updated every 6 months using in-house written scripts. 
Creation of data set section will be updated every 
3 months. 

AVAILABILITY AND REQUIREMENTS 

ccPDB is freely available at http://crdd.osdd.net/raghava/ 
ccpdb 
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